Skip to content

Flatbuffers impl #446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 19, 2025
Merged

Flatbuffers impl #446

merged 16 commits into from
May 19, 2025

Conversation

boocmp
Copy link
Collaborator

@boocmp boocmp commented Mar 28, 2025

  1. The NetworkFilterList implementation is now based on FlatBuffers.
  2. The serialization format has been updated to support this change.
  3. Serialization benchmarks have been moved to a separate file.
  4. Support for adding filters after NetworkFilterList creation has been removed. Since rebuilding the entire FlatBuffer is costly, this functionality may be reintroduced in the future if there is a clear need for it.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust Benchmark

Benchmark suite Current: 537ab54 Previous: 67b7b70 Ratio
rule-match-browserlike/brave-list 2150693604 ns/iter (± 16942629) 1803081237 ns/iter (± 16195247) 1.19
rule-match-first-request/brave-list 1017455 ns/iter (± 36469) 980016 ns/iter (± 5452) 1.04
blocker_new/brave-list 158380700 ns/iter (± 1023726) 203961371 ns/iter (± 2004867) 0.78
memory-usage/brave-list-initial 21536659 ns/iter (± 3) 41762172 ns/iter (± 3) 0.52
memory-usage/brave-list-after-1000-requests 24141128 ns/iter (± 3) 44355700 ns/iter (± 3) 0.54

This comment was automatically generated by workflow using github-action-benchmark.

@boocmp boocmp force-pushed the flatbuffers_impl branch from 91c5ec7 to cacfcc6 Compare April 9, 2025 00:55
bytes.as_ptr() as *const u16,
bytes.len() / std::mem::size_of::<u16>(),
)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that from_raw_parts results in issues if bytes.as_ptr() isn't aligned to 2 bytes We need to assert this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert added

bytes.as_ptr() as *const u16,
bytes.len() / std::mem::size_of::<u16>(),
)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

}

let filters_list =
unsafe { fb::root_as_network_filter_list_unchecked(&self.flatbuffer_memory) };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boocmp use root_as_network_filter_list() + .expect() to remove unsafe

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it runs a verification process, it'll degrade performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we build the flatbuffer just before using root_as_network_filter_list_unchecked should be fine in terms of security.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this check on deserialization

unsafe {
self._tab
.get::<flatbuffers::ForwardsUOffset<&str>>(NetworkFilter::VT_RAW_LINE, None)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

@boocmp boocmp force-pushed the flatbuffers_impl branch from cacfcc6 to a3df20d Compare April 9, 2025 01:05
@boocmp boocmp marked this pull request as ready for review April 9, 2025 01:38
@boocmp boocmp requested review from atuchin-m and antonok-edm April 9, 2025 01:38
Copy link
Collaborator

@atuchin-m atuchin-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few nits

bytes.as_ptr() as *const u16,
bytes.len() / std::mem::size_of::<u16>(),
)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

bytes.as_ptr() as *const u16,
bytes.len() / std::mem::size_of::<u16>(),
)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

src/blocker.rs Outdated
Comment on lines 392 to 371
let mut disabled_directives: HashSet<String> = HashSet::new();
let mut enabled_directives: HashSet<String> = HashSet::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not &str anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returned &str back

if self.filter_map.is_empty() {
return None;
}

let filters_list =
unsafe { fb::root_as_network_filter_list_unchecked(&self.flatbuffer_memory) };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog


if self.filter_map.is_empty() {
return filters;
}

let filters_list =
unsafe { fb::root_as_network_filter_list_unchecked(&self.flatbuffer_memory) };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

@boocmp boocmp force-pushed the flatbuffers_impl branch 4 times, most recently from 0238db7 to a40fa5e Compare April 28, 2025 07:50
bytes.as_ptr() as *const u16,
bytes.len() / std::mem::size_of::<u16>(),
)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

bytes.as_ptr() as *const u16,
bytes.len() / std::mem::size_of::<u16>(),
)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

bytes.as_ptr() as *const u16,
bytes.len() / std::mem::size_of::<u16>(),
)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

index: u32,
owner: &'a NetworkFilterList,
) -> Self {
let list_address: *const NetworkFilterList = owner as *const NetworkFilterList;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointer casting often indicates some problematic design patterns in Rust. I think we can remove it in this instance?

  • this owner field can increase memory usage since it's an additional address-sized field on each filter, even though they'd all have the same value for all filters in a list
  • I only see it used in NetworkMatchable to make it conform to the existing API. I'm not opposed to that API changing if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlatNetworkFilter is a wrapper over the fb::NetworkFilter. We only create the instance of it when found the the fb::NetworkFilter in the NetworkFilterList so it does't increase the memory usage because it temporary object on the stack. For now, regex manager is designed to use unique key to find corresponding regex, previously it was an address of the NetworkFilter, but now we don't have unique address for every filter because they stored in flatbuffer. To create unique key we use the list address combined with the filter's index. I agree it looks ugly but I don't know how to do it better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's understand why we need this unique key.
If it's only for regex manager, let's:

  1. try to move key calculation here
  2. add TODO to make keys unique only in scope of regex manager. I believe once that in one of the next steps we move from multiple flatbuffers to a just one that ables to contain multiple lists. That way we get 1:1 proportion (regex manager: flatbuffer).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this could be done in follow up: this PR is already waiting for a long time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 regex manager: 1 flatbuffer definitely sounds like the right way to do it, but agreed it can be followup if needed

Comment on lines 48 to 62
#[derive(Serialize, Deserialize)]
pub struct NetworkFilterList {
pub(crate) flatbuffer_memory: Vec<u8>,
pub(crate) filter_map: HashMap<Hash, Vec<u32>>,
pub(crate) unique_domains_hashes_map: HashMap<Hash, u16>,
}

impl Default for NetworkFilterList {
fn default() -> Self {
Self {
flatbuffer_memory: Default::default(),
filter_map: Default::default(),
unique_domains_hashes_map: Default::default(),
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[derive(Serialize, Deserialize, Default)] should automatically produce this Default implementation already

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason it doesn't, I think because of the new version of serde

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh,,god, I forgot to add Default it the derive list... I'll fix it in one of the follow ups

@boocmp boocmp requested review from antonok-edm and atuchin-m May 16, 2025 08:00
@boocmp boocmp force-pushed the flatbuffers_impl branch from a541821 to 79b0b83 Compare May 16, 2025 11:38
Copy link
Collaborator

@antonok-edm antonok-edm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other thing left unaddressed is #446 (comment), still not sure if it's required

Comment on lines 11 to 15

use crate::blocker::Blocker;
use crate::cosmetic_filter_cache::CosmeticFilterCache;

use crate::blocker::Blocker;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but I don't understand the reason to move Blocker into a separate paragraph by itself (similar change also in src/data_format/v0.rs)

Copy link

[puLL-Merge] - brave/adblock-rust@446

Description

This PR changes how serialization works in the adblock-rust library by implementing a FlatBuffers-based storage for network filters. The main motivations are to improve performance and reduce memory usage by replacing the old serialization system with a more efficient FlatBuffers approach. The PR also reorganizes the benchmarking structure by moving serialization benchmarks to a separate file.

Changes

Changes

  1. Cargo.toml

    • Changed flatbuffers dependency from optional to required
    • Removed flatbuffers-storage feature flag
    • Added a new benchmark for serialization
  2. Benchmarking

    • Moved serialization benchmarks from bench_matching.rs to a new file bench_serialization.rs
    • Updated benchmark code to call serialize() instead of serialize_raw()
  3. Engine API

    • Renamed serialize_raw() to serialize()
    • Updated deserialize() method to handle FlatBuffers format
    • Updated all references to these methods throughout the codebase
  4. FlatBuffers Implementation

    • Added a new module fb_network.rs with FlatBuffers implementation
    • Enhanced the FlatBuffers schema to include more fields
    • Implemented efficient lookups using domain hash maps
  5. Network Filter List

    • Completely refactored the NetworkFilterList to use FlatBuffers storage
    • Changed data structures to use integer indices instead of Arc pointers
    • Implemented optimizations for domain lookups
  6. Data Format

    • Updated serialization/deserialization to handle FlatBuffers format
    • Added proper error handling for FlatBuffers deserialization
  7. Blocker

    • Simplified the blocker code by removing direct filter manipulation methods
    • Changed how CSP directives are handled to be more efficient
  8. Tests

    • Updated tests to work with the new serialization format
    • Removed tests for removed functionality (like add_filter)
sequenceDiagram
    participant App as Application
    participant Engine as Engine
    participant Serializer as Serializer
    participant FB as FlatBuffers
    participant NetworkList as NetworkFilterList
    
    App->>Engine: serialize()
    Engine->>Serializer: SerializeFormat::build()
    Serializer->>FB: create FlatBuffersBuilder
    
    loop For each network filter
        FB->>FB: add filter to buffer
        FB->>NetworkList: track domain hash indices
    end
    
    FB->>Serializer: finish buffer
    Serializer->>Engine: return serialized data
    Engine->>App: return Vec<u8>
    
    App->>Engine: deserialize(data)
    Engine->>FB: parse FlatBuffer data
    FB->>NetworkList: rebuild filter map
    FB->>NetworkList: rebuild domain hash maps
    NetworkList->>Engine: return deserialized structures
    Engine->>App: Ok(())
Loading

Possible Issues

  1. The PR removes some public APIs like Blocker::add_filter() and Blocker::optimize() which might break backward compatibility for users who directly manipulate the blocker.

  2. String handling for CSP directives was changed from references to owned strings, which could potentially impact performance if these are frequently accessed.

  3. Test serialization data was completely changed, suggesting that the new format is incompatible with previously serialized data.

@boocmp boocmp force-pushed the flatbuffers_impl branch from 4a1d750 to 537ab54 Compare May 19, 2025 06:23
@boocmp boocmp merged commit 50548c4 into master May 19, 2025
8 checks passed
@boocmp boocmp deleted the flatbuffers_impl branch May 19, 2025 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants